The below is a continuation of the analysis begun in “project_analysis_report_p1.html”:

Step 6: Highly Variable Feature Selection:

Finding and displaying the most highly variable features in each subtype elucidates high cell-to-cell variation, and directly models the mean-variance relationship. The top 10 highly variable features are labelled.

#set number of highly variable features to be considered/analyzed (n_highvarfeats) and 
#labeled (n_labels) 
n_highvarfeats = 2000
n_labels = 10

#enables overlap of labels on graph
options(ggrepel.max.overlaps = Inf) 

data <- FindVariableFeatures(data, selection.method = "vst", nfeatures = n_highvarfeats)
expr_data_high <- head(VariableFeatures(data), n_labels)
expr_data_plot <- VariableFeaturePlot(data)
LabelPoints(plot = expr_data_plot, points = expr_data_high, repel = TRUE, xnudge = 0, 
ynudge = 0)

For example, one prevalent biomarker to note is MUCL1, a well studied breast cancer oncoprotein associated with tumor aggressiveness in HER2+ samples and lackthereof in ER+ samples, and has been previously identified as a potential target for its therapeutic properties (Link).

Another example is the presence of the IGHG family in ER+ samples and lackthereof in TNBC samples. These immunoglobulin families have been previously identified to promote chemosresistance in breast cancer, which directly corresponds with the respective lethality rates associated with ER+ and TNBC samples (Link).

The identification of biomarkers such as MUCL1 and the IGHG families helps us validate our analysis thus far and double-check its methodology with biological relevance.

Step 7: Scaling Data:

The top 2000 highly variable features identified in Step 6 were taken into downstream analysis to scale the data via linear transform in order to regress out unwanted sources of variation including technical noise and batch effects, as well as to prepare the data into a format that is more easily manipulable for performing PCA and clustering.

#scale data
data_genes <- rownames(data)
data <- ScaleData(data, features = data_genes)

Step 8: Linear Dimension Reduction:

After scaling the data I applied linear dimensionality reduction in the form of Principal Component Analysis (PCA) to visualize regulatory correlation patterns in marker expression within their respective subtypes. This stores 5 (by default) different PCAs containing the largest variances (both positive and negative) in gene expression in decreasing order. For example, PCA1 contains the largest source of variation, PCA2 contains the second largest source, etc. Spatial relationships based on PCA features are also displayed in order to identify trends such as patient-specific batch effects and general patters across subtypes.

#Perform PCA dimension reduction
data <- RunPCA(data, features = VariableFeatures(object = data))

#set the number of PCA plots to be used to visualize gene expression variance
dim_min = 1
dim_max = 2

#Visualizes the top sets of genes that are associated with reduction components of PCA
VizDimLoadings(data, dims = dim_min:dim_max, reduction = "pca")

#Graph of output of PCA where each point is a cell in position based on its reduction component
DimPlot(data, reduction = "pca")

Note specific markers of interest within each subtype, including SPARC and Biglycan (BGN).

SPARC, a glycoprotein that mediates interactions between cells and their extracellular surroundings has been found to play a potential role in tumor growth and metastasis (Link). This is done by enabling tumor cells to interact with other stromal cells and the ECM. SPARC is downregulated in ER+ but upregulated in both HER2+ and TNBC, correlating with subtype lethality data and the analysis done so far and displayed below.

BGN is another extracellular protein that has been identified due to its upregulation being linked to poor prognosis (Link). This protein is also found to be upregulated in both HER2+ and TNBC subtypes, and downregulated in ER+; once again corresponding with lethality data linked to these three subtypes and the PCA below.

Step 9: Determine Data Dimensionality

In preparation for performing clustering and non-linear dimensionality reduction, I want to identify the number of dimensions/principal components from my data to include. By plotting an elbow plot for each sample subtype I can directly observe this minimum PC threshold. In order to keep analysis as consistent as possible across subtypes, PC 9 was chosen as the cutoff for each sample subtype in order to ensure a balance of maintaining the most statistically significant markers while still removing any potential technical noise.

#can use an elbowplot to identify how many components to include
ElbowPlot(data) #indicates an "elbow" around PC 9

#used to determine dimensions for clustering and UMAP
elbow_dims = 9 

Step 10: Clustering:

Using the range of dimensions identified using the previous elbow plots, we can perform graph-based clustering using k-nearest neighbor euclidean distance and a Louvain algorithm. Edges are drawn between cells of similar gene/feature expression to group cell types and enable ease of spatial relationship identification.

Clustering resolution is dependent upon the number of cell per sample, but 0.4-1.2 is generally recommended for ~3k cells. As a result, a clustering resolution of 0.5 is applied to all subtypes.

#Performs graph-based clustering via K-nearest neighbor (euclidean distance)
data <- FindNeighbors(data, dims = 1:elbow_dims) #KNN graph
data <- FindClusters(data, resolution = 0.5) #Louvain algorithm

Step 11: Non-linear Dimensionality Reduction:

Non-linear dimensionality reduction (via UMAP) is now applied to preserve local distances between cell relationships and ensures cells co-localize based on gene expression. This method is less ideal for identifying global relationships, but still elucidates patterns across our three specified BC subtypes.

Having already clustered our data and created a UMAP to visualize these spatial relationships within each subtype, we can identify the top 10 upregulated and downregulated “markers” or genes/features per cluster. These markers are then used to cross-validate their prevalence and role within the tumor microenvironment using the databases CellMarker 2.0 and The Human Protein Atlas.

From cross-checking the statistical significance of specific markers in each cluster with the databases listed above, we can begin to label each cluster with their respective cell type. This provides us with a holistic picture of the TME for each respective BC subtype and draw trends across population patterns.

#UMAP creation (unlabeled)
data <- RunUMAP(data, dims = 1:elbow_dims)
DimPlot(data, reduction = "umap", label = TRUE)

#compare differentially expressed features and categorize as "markers"
data_markers <- FindAllMarkers(data, only.pos = FALSE, min.pct = 0.25, logfc.threshold = 0.25)

#identify largest log fold changes for upregulated genes per cluster
top_markers <- data_markers %>%
  filter(p_val_adj < 0.05) %>%
  group_by(cluster) %>%
  arrange(desc(avg_log2FC)) %>%
  slice_head(n = 10) %>%
  ungroup()
#view(top_markers)

#identify largest log fold changes for upregulated genes per cluster
bottom_markers <- data_markers %>%
  filter(p_val_adj < 0.05) %>%
  group_by(cluster) %>%
  arrange(avg_log2FC) %>%
  slice_head(n = 10) %>%
  ungroup()
#view(bottom_markers)

Within the HER2+ subtype specifically we can observe a comparable population of Naive CD4+ T cells and CD8+ T cells as in the ER+ subtype. Furthermore, we see a growing population of M2 macrophages as the lethality of the cancer subtype increases from ER+ to TNBC, as well as a growing population of canonical NK cells which dominate the TNBC subtype.

In the ER+ subtype UMAP we can see a significantly large population of Cancer-associated Fibroblasts, or CAFs, which have been shown to initiate a paracrine signaling pathway in ER+ BC cells and induces a TME that regulates tumor progression by enhancing proliferation (Link). This is further supported by the prevalence of CD24 that acts as a potential biomarker for ER+ BC because of its role in cell migration and proliferation.

Additionally, there is a substantial population of CD8+ NK cells in both ER+ and TNBC samples, which are known to be more cytolytic and anti-tumoral. These findings are also largely supported in literature as immature NK cells are prevalent in TNBC and linked to its poor overall survival rate (Link). Lastly, TNBC consists of the largest M2 macrophage population, which has been highlighted in past studies demonstrating the responsibilities of tumor-associated macrophages (TAMs) that express the M2 phenotype with sufficient immunosuppressive activity, resulting in a pro-tumor role in TNBC (Link).

These findings illustrate a spectrum of gene expression profiles and immune cell populations, pinpointing differentially expressed markers that may play pivotal roles in the progression of breast cancer. This segment of the study established a foundational framework for subsequent immune system composition analyses, such as gene set enrichment analysis (GSEA), immune system composition analysis, and/or pathway analysis. These analyses would help create a more holistic overview of the TME for each BC subtype by solidifying the significance of each subtype-specific cell cluster and their regulatory pathways involved.